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Abstract 

This  paper  develops  a  new  approximate  dynamic  programming  algorithm  for  Markov  decision 
problems  and  applies  it  to  a  vehicle  dispatching  problem  arising  in  warehouse  management. 

The  algorithm  is  of  the  actor-critic  type  and  uses  a  least  squares  temporal  difference  learning 
method.  It  operates  on  a  sample-path  of  the  system  and  optimizes  the  policy  within  a  pre¬ 
specified  class  parameterized  by  a  parsimonious  set  of  parameters.  The  method  is  applicable 
to  a  partially  observable  Markov  decision  process  setting  where  the  measurements  of  state 
variables  are  potentially  corrupted  and  the  cost  is  only  observed  through  the  imperfect  state 
observations.  We  show  that  under  reasonable  assumptions,  the  algorithm  converges  to  a  locally 
optimal  parameter  set.  We  also  show  that  the  imperfect  cost  observations  do  not  affect  the 
policy  and  the  algorithm  minimizes  the  true  expected  cost. 

In  the  warehouse  application,  the  problem  is  to  dispatch  sensor-equipped  forklifts  in  order 
to  minimize  operating  costs  involving  product  movement  delays  and  forklift  maintenance.  We 
consider  instances  where  standard  dynamic  programming  is  computationally  intractable.  Simu¬ 
lation  results  confirm  the  theoretical  claims  of  the  paper  and  show  that  our  algorithm  converges 
more  smoothly  than  earlier  actor-critic  algorithms  while  substantially  outperforming  heuristics 
used  in  practice. 

Keywords:  Markov  decision  processes;  partial  observability,  approximate  dynamic  program¬ 
ming,  actor-critic  algorithms,  warehouse  management,  vehicle  routing. 

1  Introduction 

Markov  Decision  Processes  (MDPs)  are  widely  used  for  modeling  sequential  decision-making  prob¬ 
lems  that  arise  in  engineering,  economics,  computer  science,  and  social  sciences.  However,  it  is 
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well-known  that  many  such  problems  suffer  from  the  curse  of  dimensionality,  which  makes  it  dif¬ 
ficult  to  obtain  practical  solutions  for  realistically-sized  instances.  In  some  cases,  the  system  of 
interest  is  so  complex  that  it  is  not  even  feasible  to  determine  all  the  MDP  model  parameters 
explicitly.  To  overcome  the  difficulty,  earlier  work  has  focused  on  simulation-based  approximate 
dynamic  programming  (see,  [33,  17]),  treatments  of  which  have  appeared  under  the  terms  reinforce¬ 
ment  learning  [39]  and  neuro-dynamic  programming  [10].  Actor-critic  algorithms  ([4]),  which  we 
adopt  in  this  work,  fall  within  this  broad  category. 

Actor-critic  algorithms  are  typically  used  to  optimize  some  Randomized  Stationary  Policy  (RSP) 
using  policy  gradient  estimation.  RSPs  are  parameterized  by  a  parsimonious  set  of  parameters 
and  the  objective  is  to  optimize  the  policy  (hence,  the  use  of  gradient  methods)  with  respect  to 
these  parameters.  To  that  end,  one  needs  to  estimate  appropriate  policy  gradients,  which  can  be 
done  using  learning  methods  and  is  much  more  efficient  than  estimating/computing  a  cost-to-go 
function  in  the  entire  state-action  space.  Many  different  versions  of  actor-critic  algorithms  have 
been  proposed,  e.g.,  [11,  12,  18,  21,  27,  29,  35],  and  they  have  been  shown  to  be  quite  efficient  for 
various  applications  (e.g.,  in  robotics  [31]  and  robot  navigation  [37],  power  management  of  wireless 
transmitters  [7],  biology  [23],  and  optimal  bidding  for  electricity  generation  companies  [20]). 

A  particularly  attractive  design  of  the  actor-critic  architecture  was  proposed  in  [25],  where  the 
critic  estimates  the  policy  gradient  using  sequential  observations  from  a  sample  path  while  the 
actor  updates  the  policy  at  the  same  time,  although  at  a  slower  time-scale.  It  was  proved  that  the 
estimate  of  the  critic  tracks  the  slowly  varying  policy  asymptotically  under  suitable  conditions.  A 
center  piece  of  these  conditions  is  a  relationship  between  the  actor  step-size  and  the  critic  step-size, 
which  will  be  discussed  later. 

The  critic  of  [25]  uses  first-order  variants  of  the  Temporal  Difference  (TD)  algorithm  (TD(1) 
and  TD(A)).  However,  it  has  been  shown  that  the  least  squares  TD  methods  (LSTD  and  LSPE)  are 
superior  in  terms  of  convergence  rate;  see  [9,  13,  15,  24,  28].  LSTD  and  LSPE  were  first  proposed 
for  discounted  cost  problems  in  [15]  and  [8],  respectively.  Later,  [24]  showed  that  the  convergence 
rate  of  LSTD  is  optimal.  More  recently,  [42]  proved  the  convergence  of  both  LSTD  and  LSPE  for 
the  average  cost  problem  and  showed  that  both  LSTD  and  LSPE  enjoy  the  optimal  convergence 
rate.  Their  results  clearly  demonstrated  that  LSTD  and  LSPE  converge  much  faster  and  more 
reliably  than  TD(1)  and  TD(A). 

In  this  paper,  we  focus  on  average  cost  problems  with  finite  state  and  control  space  (see  [34] 
for  an  introduction  on  how  to  solve  average  cost/reward  problems).  Motivated  by  these  findings, 
we  propose  an  actor-critic  algorithm  that  adopts  the  least  squares  learning  methods  while,  at  the 
same  time,  maintains  the  concurrent  update  architecture  of  the  actor  and  the  critic.  Note  that 
[32]  also  used  LSTD  in  an  actor-critic  method  mostly  applicable  to  episodic  problems,  but  the 
actor  there  had  to  wait  for  the  critic  to  converge  before  making  each  policy  update.  Further,  [32] 
provided  an  approach  to  take  advantage  of  the  freedom  of  augmenting  the  set  of  basis  functions 
for  Q- value  approximation;  something  which  is  not  considered  in  the  present  paper.  An  interesting 
future  direction  could  be  the  integration  of  the  work  in  this  paper  with  elements  of  [32] . 

In  addition  to  the  use  of  LSTD,  we  demonstrate  how  the  proposed  algorithm  can  be  modified  to 
apply  to  a  Partially  Observable  Markov  Decision  Process  (POMDP)  setting.  Such  a  setting  is  rather 
natural  in  the  particular  warehouse  management  application  we  study.  Partial  observability  in 
reinforcement  learning  algorithms  has  been  studied  before  (see,  for  example,  [26,  22,  5,  6,  1,  38,  40]). 
[40]  first  proposed  the  use  of  actor-critic  algorithms  for  POMDPs  with  perfectly  observable  cost 


and  showed  that  the  gradient  estimation  can  be  done  by  making  the  critic  compute  a  value  function 
(conditional  mean  of  the  true  value  function)  which  does  not  depend  on  the  states.  We  built  on  this 
result  and  consider  the  case  where  the  cost  is  also  not  perfectly  observable.  We  show  that  under 
some  separability  assumptions,  the  corresponding  actor-critic  algorithm  converges  to  the  locally 
optimal  solution  despite  the  imperfect  cost  information. 

The  methodological  work  in  this  paper  is  motivated  by  and  applied  to  the  problem  of  intelli¬ 
gently  dispatching  forklifts  in  a  warehouse.  We  are  interested  in  sensor-equipped  forklifts  that  can 
determine  their  status  (condition  monitoring),  their  physical  location  in  the  warehouse,  and  con¬ 
tinuously  transmit  this  information  wirelessly  to  some  central  gateway.  Leveraging  the  information 
made  available,  we  design  a  forklift  dispatching  algorithm  to  minimize  operational  costs  associated 
with  product  delivery  delays  and  forklift  maintenance.  In  this  application,  the  POMDP  setting  is 
appropriate  as  the  forklift  localization  and  condition  monitoring  system  is  subject  to  estimation 
errors. 

In  what  follows,  formulation  of  the  MDP  problem  is  in  Section  2.  The  LSTD  actor-critic 
algorithm  with  concurrent  updates  is  presented  in  Section  3.  In  Section  4,  we  propose  the  LSTD 
Partially  Observable  (LSTD-PO)  actor-critic  algorithm.  The  warehouse  management  application 
is  then  discussed  in  Section  5,  where  the  effectiveness  of  the  proposed  algorithm  is  demonstrated. 
Conclusions  are  drawn  in  Section  6. 

Notation:  We  use  bold  letters  to  denote  vectors  and  matrices;  typically  vectors  are  lower  case  and 
matrices  upper  case.  Vectors  are  assumed  to  be  column  vectors  unless  explicitly  stated  otherwise. 
For  economy  of  space  we  write  x  =  (x\, . . .  ,xn)  for  the  n-dimensional  column  vector  x  £  Mn. 
Transpose  is  denoted  by  prime.  For  any  m  x  n  matrix  A,  with  rows  ai, . . . ,  a m  £  Mn,  v(A)  will 
denote  the  column  vector  (ai, . . . ,  am).  ||  •  ||  will  denote  the  Euclidean  norm  and  ||  •  ||#  a  special  norm 
in  the  MDP  state-action  space  we  will  introduce.  0  denotes  a  vector  or  matrix  with  all  components 
set  to  zero  and  I  is  the  identity  matrix. 


2  Problem  formulation 

The  problem  considered  here  is  rather  standard  and  we  will  state  it  briefly. 

2.1  Average-cost  MDP 

Consider  an  average-cost  MDP  with  finite  state  and  action  spaces.  Let  k  denote  time,  X  denote 
the  state  space,  and  U  denote  the  action  space.  Consider  x  €  X  and  u  £  U.  Let  g(x,  u)  be  the  one- 
step  cost.  The  policy  candidates  are  assumed  to  belong  to  a  parameterized  family  of  Randomized 
Stationary  Policies  (RSPs)  {ng{u |x)  |  6  £  Mn}.  That  is,  given  a  state  x  and  a  parameter  0,  the 
policy  applies  action  u  with  probability  /j,g(u\x).  The  goal  is  to  optimize  the  average  cost  over  the 
n-dimensional  vector  6. 

Let  p(£|x,  it)  denote  the  state  transition  probabilities  (which  are  typically  not  explicitly  known); 
that  is,  p(£|x,  it)  is  the  probability  of  transitioning  to  state  £  given  that  action  u  is  taken  while  the 
system  is  at  state  x.  Under  suitable  ergodicity  conditions,  {x/,.}  and  {xk,Uk}  are  Markov  chains 
with  stationary  distributions  under  a  fixed  policy.  These  stationary  distributions  are  denoted  by 
7Te(x)  and  i]g(x,u),  respectively.  The  average  cost  of  the  MDP  is  thus  well  defined  and  given  by 


a(G)  =  u  ??0(xi  «)<?(x,  it).  We  will  not  elaborate  on  the  ergodicity  conditions,  except  to  note 
that  it  suffices  that  the  process  {x/,.}  is  irreducible  and  aperiodic  given  any  6 ,  and  for  any  x  G  X 
and  u  G  U, 


either 


inf  //e(tt|x)  >0  or  ^e(u|x)  =  0 

0 


(1) 


(see  Theorem  2.6  of  [24]).  One  possible  form  of  RSP  that  satisfies  this  assumption  uses  the 
Boltzmann-like  distribution: 


M  6>Hx) 


x(x,  ^)eft(“.x-0) 
EaeuX^wJe^0’*1®)’ 


(2) 


where  x(x,  u)  is  nonnegative.  Note  that  it  is  possible  to  assign  zero  to  x(x> u )  for  some  pairs 
of  (x,  u),  i.e.,  exclude  certain  actions  for  some  state  values.  The  Boltzmann  policy  offers  a  nice 
balance  of  flexibility  and  simplicity.  Indeed,  it  turns  out  to  be  sufficient  for  the  warehouse  vehicle 
dispatching  problem  discussed  later  in  the  paper.  The  RSP  can  be  expensive  to  compute  when  the 
action  space  is  large,  but  in  typical  applications  including  the  warehouse  case,  the  action  space  is 
relatively  small. 


2.2  Actor-Critic  algorithm 

With  no  explicit  model  of  the  state  transitions  but  only  a  sample  path  denoted  by  {xk,Uk},  the 
actor-critic  algorithms  typically  optimize  0  locally  in  the  following  way:  first,  the  critic  estimates 
the  policy  gradient 

da(6)  da(6) 

-deT'-'-der* 

using  some  type  of  a  Temporal  Difference  (TD)  algorithm;  then  the  actor  modifies  the  policy 
parameter  along  the  gradient  direction. 

Define  the  so-called  Qe-value  function  as  any  function  satisfying  the  Poisson  equation 


Qe(x,  u)  =  g(x,  u)  -  a(0)  +  (PgQg)(x,  u ), 

where  the  operator  Pg  denotes  taking  expectation  after  one  transition,  namely,  ( PgQ){x,u )  = 
Sgex  uev  lx'  u)Q(£i  u)-  Qe(x,  u)  can  be  interpreted  as  the  expected  future  excess  cost 

(on  top  of  the  average  cost)  we  incur  if  we  start  at  state  x,  apply  control  u.  and  then  follow  the 
RSP  fig-  Let  now 

•*MX,  u)  =  {lpg{x,  «),...,  1^g{x,  u)), 

where 

d 

^g(x,u)  =  —  In ng(u\x),  i  =  l,...,n. 

It  is  assumed  that  ij)g(x,u)  is  bounded  and  continuously  differentiable,  which  is  true  for  policies 
of  the  form  (2).  We  make  the  convention  '0e(x,  u)  =  0  when  /J,g(u |x)  =  0  for  all  values  of  6  (cf. 
Assumption  (1)). 

A  key  fact  underlying  the  actor-critic  algorithm  is  that  the  policy  gradient  can  be  expressed  as 
(see  [25]  for  more  details) 

da(0)  j  .  _ 

. ) n  {Qe,  yO/Oi  *  1,  .  .  .  ,  U, 


where  the  operator  (•,  -)g  is  defined  as:  for  any  two  functions  /i  and  /2  of  x  and  u, 

(/ij  $2)0  =  ^  /?e(x,u)/i(x,u)/2(x,u).  (4) 

X,1i 

This  is  very  interesting  because  it  suggests  an  equivalent  class  of  the  Qe-value  function  in  terms 
of  estimating  the  policy  gradient.  For  any  function  /(x,  u),  let  f  be  its  vector  representation  in 
Rlxllul  —  the  notations  Q  and  tpg  (used  next)  are  instances  of  this  vector  representation.  Then, 
the  operator  (4)  defines  a  inner  product  in  M.lxl lul .  Let  ||  •  ||#  denote  the  norm  induced  by  this  inner 
product,  i.e.,  ||f|||  =  (f,  i)e  =  (/,  f)e-  Let  also  S^e  be  the  subspace  of  RlxHul  spanned  by  the  vectors 
00,  i  =  1,  •  •  •  ,n.  Next,  denote  by  II©  the  projection  with  respect  to  the  norm  ||  •  ||g  onto  ^ 0 ,  i.e., 
for  any  f  G  Rlxllul,  n^f  is  the  unique  vector  in  that  minimizes  ||f  —  f||g  over  all  f  €  Since 
for  all  i 

(Qe^^e  =  (n  oQo,iPo)oi 

it  is  sufficient  to  know  the  projection  of  Q0  onto  in  order  to  compute  Va(6).  Consequently, 
for  policy  gradient  estimation,  one  may  substitute  Qq  with  a  parametric  linear  architecture  of  the 
following  form  (see  [25]): 

Qg(x,u)  =  00(x,u)r*,  r*  G  Rn.  (5) 

This  dramatically  reduces  the  complexity  of  learning  from  the  space  RlxHul  to  the  space  Rn.  Fur¬ 
thermore,  the  temporal  difference  algorithms  can  be  used  to  learn  such  an  r*  effectively.  The 
elements  of  00(x,  u)  are  understood  as  basis  functions  used  to  develop  an  approximation  of  the 
policy  gradient.  To  sum  up  the  key  point  of  this  section:  the  policy  gradient  of  the  average  cost 
MDP  can  be  expressed  as 

=  <^«(x,«)r*,0J)e.  (6) 

The  actor-critic  algorithm  aims  at  finding  good  approximations  of  r* ,  and  consequently  of  the 
policy  gradient. 


3  Actor-Critic  algorithm  using  LSTD 

The  critic  in  [25]  used  either  TD(A)  or  TD(1).  The  algorithm  we  propose  uses  a  least  squares  TD 
method  (LSTD)  instead,  as  it  has  been  shown  to  provide  far  superior  performance  ([42]). 

The  actor  and  the  critic  updates  take  place  in  the  course  of  a  simulation  of  a  single  sample  path 
of  the  MDP.  Let  x*,  denote  the  state  of  the  system  at  time  k.  Let  r^,  the  iterate  for  r*  in  (5),  be 
the  parameter  vector  of  the  critic  at  time  k,  6 j,  be  the  parameter  vector  of  the  actor  at  time  k,  and 
Xfc_|_i  be  the  new  state,  obtained  after  action  Uk  is  applied  when  the  state  is  x^.  A  new  action  Uk+ 1 
is  generated  according  to  the  RSP  corresponding  to  the  actor  parameter  6k  (see  [25]).  The  critic 
and  the  actor  carry  out  the  following  updates  where  ccfc  is  an  estimate  of  the  average  cost,  z G  Rn 
is  often  called  Sutton’s  eligibility  trace  [39]  which  is  a  weighted  sum  of  the  present  and  past  feature 
vectors  obtained  from  the  simulations,  G  Rn  represents  the  eligibility  trace  scaled  by  the  drifts 
of  single  period  reward  from  the  average  cost,  and  A i~  G  Rnxn  is  a  sample  estimate  of  the  matrix 
formed  by  zk(i/>'dk(xk+1  ,uk+1)  -  0^ (xfc, uk)). 


LSTD  Actor-Critic 


Initialization: 


Set  all  entries  in  zo,ao,Ao,bo  and  ro  to  zeros.  Let  do  take  some  initial  value,  potentially 
corresponding  to  a  heuristic  policy. 

Critic: 

1  —  C^k  T  'Ik  [fi^Xfc,  U k )  ®fc]  j 

zfc+ 1  “1“  (xfci  ^fc)) 

bfc+i  =  bfc  +  7fc  [(g(xfc,  Ufe)  -  afc)zfc  -  bfe] , 

Afc+i  =  Ak  +  [zfe(V>efc(xfc+i,  Wfc+i)  -^ek(Ak,Uk))  ~  Ak]  , 

where  A  6  [0, 1),  7fc  =  7  corresponds  to  the  original  LSTD,  and  finally 
k 

rk+i  =  -A^1bfc.  (8) 


Actor: 

Qk+ 1  =  ^k  ~  /3fcr(rfc)V>efc(xfc+1,ufc+i)rfc^0fc(xfc+i,ufc+i).  (9) 

In  the  above,  {7*,}  controls  the  critic  step-size,  while  {/3k}  and  T(r)  control  the  actor  step-size 
together.  An  implementation  of  this  algorithm  needs  to  make  these  choices.  The  role  of  T(r)  is 
mainly  to  keep  the  actor  updates  bounded,  and  we  can  for  instance  use 


T(r) 


if  | |r 1 1  >  D, 
otherwise, 


for  some  D  >  0.  {/3k}  is  a  deterministic  and  non-increasing  sequence  for  which  we  need  to  have 


X  fa  =  00, 

k 


X^  <  °°> 


lim  —  =  0, 

fc^OO  7fc 


An  example  of  {/ 3k }  satisfying  Eq.  (10)  is 


Pk 


c 

k  In  k ’ 


k  >  1, 


(10) 


where  c  >  0  is  a  constant  parameter. 

Note  that  the  critic  update  above  is  essentially  the  standard  LSTD  [42],  and  the  actor  update 
is  the  same  as  that  of  [25].  However,  the  nontrivial  task  here  is  to  show  that  the  combination 
works.  Specifically,  we  need  to  show  that  the  LSTD  critic  also  converges  despite  policy  updates  of 
the  actor,  and  consequently  the  actor  converges  as  well.  The  conclusion  is  stated  in  the  following 
theorem,  while  the  proof  is  given  in  Appendix  A. 


Theorem  3.1  [ Convergence ]  For  the  LSTD  actor-critic  with  some  step-size  sequence  {Pk}  satisfy¬ 
ing  (10)  and  for  any  e  >  0,  there  exists  some  A  sufficiently  close  to  1  such  that  lim  inf k  ||  Va(0fc)||  <  e 
w.p.l.  That  is,  6k  visits  an  arbitrary  neighborhood  of  a  stationary  point  infinitely  often. 


4  Actor-Critic  algorithms  for  POMDPs  with  indirectly  observed 
cost 

A  Partially  Observable  Markov  Decision  Process  (POMDP)  is  a  generalization  of  an  MDP  in  which 
the  system  dynamics  follow  an  MDP  but  the  agent  cannot  directly  observe  the  underlying  state. 
Instead,  it  may  maintain  a  probability  distribution  over  the  set  of  possible  states,  based  on  a  set  of 
observations.  The  POMDP  framework  is  general  enough  to  model  a  variety  of  real-world  sequential 
decision  processes  (for  a  review,  see  [16]). 

It  was  shown  in  [40]  that  if  the  cost  of  a  POMDP  is  perfectly  observable,  then  the  analytical 
results  on  TD  and  actor-critic  algorithms  can  be  transferred  without  difficulty.  In  this  section, 
the  application  of  the  LSTD  actor-critic  algorithm  is  extended  to  the  case  of  a  POMDP  where,  in 
addition  to  the  state  being  partially  observable,  the  critic  observes  the  cost  only  through  its  (noisy) 
observations  of  the  state.  To  the  best  of  our  knowledge,  there  is  no  previous  work  in  the  literature 
that  considers  such  a  POMDP  model  in  an  actor-critic  framework.  There  can  be  a  number  of 
reasons  why  the  cost  is  not  directly  observable.  For  example,  the  wear  of  an  equipment  can  be 
precisely  known  only  at  the  time  of  maintenance,  not  at  the  time  of  operation.  However,  one  may 
have  a  less  precise  but  more  timely  estimate  of  this  cost  based  on  the  usage  of  the  equipment.  In  this 
section,  and  under  reasonable  assumptions,  the  actor-critic  algorithms  are  modified  to  overcome  the 
imperfect  knowledge  of  costs.  An  example  scenario  is  the  warehouse  vehicle  dispatching  problem 
we  tackle  later. 

Consider  the  formulation  introduced  in  Section  2  and  suppose  the  state  x  has  a  bijective  rep¬ 
resentation  (xW,  x(2), . . . , x(M))  where  x^1)  G  XW,x®  G  X^2), . . . ,  x^M)  G  X.(M\  Of  course,  the 
non-trivial  case  M  >  1  is  of  interest,  especially  where  the  size  of  the  problem  grows  mostly  due 
to  M,  while  the  cardinality  of  each  individual  X^i  remains  moderate.  For  the  observation  model, 
to  which  we  will  also  be  referring  to  as  the  noise  model ,  assume  that  an  observation  y  is  gener¬ 
ated  from  the  true  state  x  according  to  some  known  probability  distribution  p(y|x)  represented  as 
a  matrix  Py|x  (with  rows  and  columns  indexed  by  the  possible  values  of  x  and  y,  respectively). 
It  is  further  assumed  that  the  observations  belong  to  the  state  space,  i.e.,  y^  is  represented  by 
(y(1),y(2),  •  •  • ,  y^Mi)  where  y^  G  X®,  1  =  1,...,  M.  Note  that  the  partial  observability  here  is  in 
the  sense  that  observations  are  subject  to  noise  and  errors.  Furthermore,  it  is  assumed  that  the 
cost  is  separable  as  indicated  in  the  following  assumption. 

Assumption  A 

(i)  (Separability)  The  cost  function  can  be  written  as  g(x,u)  =  g\ (x^1))  +  g2(x^)  +  •  •  •  + 
Sm(x(M))  +  gu{u). 

(ii)  The  noise  model  can  also  be  expressed  as  M  independent  distributions  represented  by  full-rank 

matrices  Py(oix(i),  l  =  1  ,M. 

Note  that  it  is  realistic  to  assume  the  invertibility  of  Py(i)ix(0,  because  in  most  applications, 
the  quality  of  state  observations  is  reasonably  good,  hence,  Py(i)ix(0  is  close  to  the  identity  matrix. 
The  RSP  in  the  POMDP  case  is  chosen  to  be  a  function  of  the  observation;  that  is,  go(uk\ykf- 
This  form  of  RSP  can  be  accommodated  by  augmenting  the  state  to  {x^y*.}.  As  shown  in  [40], 
the  actor-critic  algorithm  still  converges  to  the  optimal  policy  if  the  cost  is  perfectly  observed. 
For  the  case  where  the  cost  is  not  perfectly  observable,  we  introduce  the  following  LSTD  Partially 
Observable  (LSTD-PO)  actor-critic  algorithm. 


LSTD-PO  Actor-Critic  for  POMDP  with  Indirectly  Observed  Cost 
Initialization: 


Set  zo,aO)  Ao,bo  and  ro  to  zeros.  Let  6 o  take  some  initial  value,  potentially  corresponding  to 
a  heuristic  policy. 


Critic: 

1  —  C^k  T  7fc  [9i.ykj  ^fc)  j 
'Z,k+ 1  —  Az&  T  Ip 0k(y ki  Hk') i 
bfc+r  =  bA.  +  7fc  [(g( yk,  uk)  -  ak)zk  -  bfc] , 

Afc+i  =  Ak  +  7fc  [zk('ip'gk(yk+i,uk+1)  -  'ip'0k(yk,uk))  -  Ak]  , 
rk+i  =  -AAT1bfc. 


In  the  above,  A  G  [0, 1),  'yk 


A  1 

“  V 


and 


(11) 


M 


g(  y,u)  = 


,(0 


/  ^  /(x(0),/(y( 0) 

1=1  x(0 


ffi(x(i))  +  5t/(«) 


(12) 


where  /(•)  is  a  function  that  returns  the  index  of  a  given  element  in  the  component  space  X® 
(consistently  with  the  ordering  of  the  rows  and  columns  of  Py(0|x(o)>  and  vfj  is  the  element  located 
on  the  ith  row  and  jth  column  of  P~ (o  |x(«)  -  F°r  convenience,  from  now  on  the  notation  y(i) 

will  be  used  for  Iry(i)y  an(l  9  will  be  called  the  cost  proxy.  As  will  be  shown,  this  cost  proxy 

averages  to  the  same  value  as  the  true  cost. 

Actor: 

ek+1  =  ek-  Pk^^k),tPek(yk+^uk+iyk,tPok(yk+uuk+1).  (13) 

The  step  size  sequences  follow  the  same  rules  as  in  the  previous  section.  Note  that  the  only  difference 
here  is  the  introduction  of  the  cost  proxy. 


Theorem  4.1  For  any  policy  parameter  6,  the  average  true  cost  ag  equals  the  average  cost  proxy 

=  ^2rie(y,u)g(y,u), 

y  ,u 

where  f)g(y.  u)  is  the  stationary  distribution  of  the  process  {yk,uk}- 
See  Appendix  B  for  the  proof. 

Remarks  : 


1.  We  should  clarify  that  the  average  true  cost  ag  is  the  average  cost  resulting  from  the  RSP 
pg(u\y)  applied  to  the  “noisy”  version  of  the  state  and  is  not  equal  to  the  average  cost  of  an 
RSP  /ig(u |x)  that  has  access  to  the  true  state.  Clearly  the  optimal  (over  9 )  former  RSP  is 
suboptimal  to  the  optimal  latter  RSP. 

2.  We  also  note  that  the  proposed  LSTD-PO  actor-critic  algorithm  optimizes  the  average  cost 
proxy  ag  in  the  same  way  the  original  LSTD  actor-critic  algorithm  optimizes  the  average  true 
cost.  Then,  since  according  to  the  above  theorem  these  costs  coincide  for  every  6 ,  it  follows 
that  the  LSTD-PO  actor-critic  optimizes  the  average  true  cost. 


3.  One  possible  technical  concern  here  is  whether  the  core  iteration  of  the  LSTD-PO  critic 
(defined  by  (11))  converges  like  that  of  the  normal  LSTD  critic  (defined  by  (7))  does.  To 
address  this  concern,  note  that  (11)  differs  from  (7)  in  exactly  two  ways:  first,  (11)  uses  the 
sample  path  of  y  instead  of  x;  and  second,  (11)  substituted  the  true  cost  (a)  with  the  cost 
proxy  (a).  The  first  difference  could  affect  the  convergence  of  the  algorithm  if  the  y-based 
process  has  a  different  ergodicity  property.  But  that  is  not  possible  because  y  is  generated 
by  x  with  a  full-rank  probability  transition  matrix  —  consequently  the  cr-algebra  of  y  is 
topologically  equivalent  to  that  of  x,  and  the  distribution  of  y  stabilizes  whenever  that  of  x 
stabilizes.  On  the  other  hand,  the  second  difference  turns  out  not  really  a  difference,  because 
Theorem  4.1  shows  a  =  a  for  any  value  of  6.  In  conclusion,  we  do  not  need  to  give  a  separate 
proof  for  the  convergence  of  the  LSTD-PO  actor-critic. 

It  should  be  emphasized  that  the  primary  role  of  separability  (Assumption  A(i))  is  to  allow 
the  LSTD-PO  algorithm  be  implemented  with  mild  computation  cost.  If  the  state  space  is  not 
separable,  then  a  huge  matrix  Py|x  has  to  be  inverted  in  lieu  of  M  smaller  matrices  Py(i)ix(0-  Even 
in  the  case  that  one  of  the  latter  matrices  is  relatively  large,  it  will  typically  be  sparse,  which  makes 
it  appropriate  to  store  and  compute  its  inverse  using  LU  factorization.  Very  briefly,  LU  factorization 
is  a  known  method  to  solve  a  general  system  of  linear  equations  of  the  form  Ax  =  b.  It  factors  a 
non-singular  matrix  A  €  Mnxn  as  A  =  PLU  where  P  €  Mnxn  is  a  permutation  matrix,  L  €  Mnxn 
is  unit  lower  triangular,  and  U  €  Mnxn  is  upper  triangular.  The  standard  algorithm  for  computing 
an  LU  factorization  is  Gaussian  elimination  with  partial  pivoting  or  Gaussian  elimination  with  row 
pivoting  ([14]).  The  cost  is  flops  if  no  structure  in  A  is  exploited  where  a  flop  is  defined  as  one 
addition,  or  subtraction,  or  multiplication,  or  division  of  two  floating-point  numbers.  To  compute 
A^1,  we  can  solve  the  equations  Ax,;  =  e.;,  where  x;  is  the  ith  column  of  A-1,  and  e;  is  the  ith  unit 
vector,  resulting  in  a  total  cost  of  flops.  However,  if  A  is  sparse,  computing  the  LU  factorization 
is  less  costly  and  in  many  cases  the  cost  grows  approximately  linearly  with  n  when  n  is  large. 

Also  note  that  LSTD-PO  requires  a  single  column  from  each  inverse  matrix  of  the  noise  model 
at  each  iteration  (the  columns  corresponding  to  the  observed  state),  and  these  could  be  computed 
using  the  technique  above.  Thus,  not  only  can  the  computational  cost  be  dissipated  over  the 
lifetime  of  the  learning  algorithm,  but  some  columns  of  the  inverse  might  not  be  computed  at  all 
if  the  corresponding  states  are  never  visited.  As  a  final  note  on  computational  matters,  efficient 
techniques  for  storing  the  sparse  matrix  can  be  found  in  [19,  36]  and  [3]. 


5  Forklift  dispatching  in  warehouse  management 

In  this  section,  we  demonstrate  how  the  proposed  LSTD  actor-critic  can  be  used  effectively  in 
dealing  with  a  real-world  problem  (forklift  dispatching  in  warehouses).  Our  model  of  the  warehouse 
described  in  this  section  is  based  on  an  on-site  visit  and  discussion  with  the  forklift  manufacturer  and 
the  warehouse  managers.  Simulation  is  used  to  evaluate  the  efficiency  of  the  proposed  algorithm. 

5.1  Setup 

Consider  a  warehouse  with  multiple  items  being  supplied  and  demanded  dynamically.  The  items 
come  in  and  out  of  the  warehouse  through  a  central  depot.  After  coming  in,  they  are  first  stored 


Figure  1:  Schematic  top- view  of  the  warehouse  and  an  illustration  of  Task  1. 


at  some  vertically  stacked  reserve  locations,  and  then  moved  to  pick-up  locations  that  are  on  the 
ground  level  and  are  specific  for  each  item.  Fig.  1  and  Fig.  2  show  a  schematic  top-view  of  the 
warehouse  and  a  side-view  of  an  isle  in  the  warehouse,  respectively. 

Three  types  of  tasks  are  performed  in  the  warehouse: 

•  Task  1:  Moving  unloaded  items  from  the  depot  to  the  reserve  locations. 

•  Task  2:  Moving  items  from  the  reserve  locations  to  fill  the  corresponding  pick-up  locations. 

•  Task  3:  Moving  demanded  items  from  the  pick-up  locations  to  the  depot. 

The  forklifts  perform  Tasks  1  and  2,  and  Task  3  is  done  by  a  different  type  of  vehicle  (called 
pallet-truck).  Any  item  in  the  warehouse  must  go  through  a  complete  cycle  of  tasks  in  the  above 
order  (i.e. ,  Task  1  — *  Task  2  — >  Task  3).  Figs.  1,  2  and  3  illustrate  Tasks  1  and  2  and  an  overview 
of  all  tasks  in  the  warehouse,  respectively. 

The  operation  of  pallet  trucks  is  simple,  rigid,  and  with  no  substantial  opportunity  for  optimiza¬ 
tion.  Only  the  dispatching  of  the  forklifts  is  considered  here,  with  the  goal  of  moving  items  through 
the  cycle  as  efficiently  as  possible.  In  addition  to  Tasks  1  and  2  described  above,  the  forklift  may 
also  be  commanded  to  go  to  the  battery /maintenance  shop  or  simply  idle.  Intuitively,  the  factors 
that  need  to  be  considered  include  the  urgency  of  the  demand  for  specific  items,  the  urgency  of 
unloading  the  supplied  items  and  freeing  the  suppliers’  trucks,  traffic  congestion  experienced  by  the 
forklifts,  and  the  “health”  of  the  forklifts  (e.g.,  battery  status,  or  other  indications  from  a  machine 
condition  monitoring  system).  However,  with  a  large  number  of  items  that  have  different  values, 
demand  patterns  and  other  contextual  variables,  forklift  dispatching  is  a  high- dimensional  dynamic 
decision  problem. 

The  problem  is  naturally  formulated  as  an  average-cost  infinite-horizon  MDP.  Let  N  be  the 
number  of  items,  I  be  the  number  of  isles,  and  F  be  the  number  of  forklifts  in  the  warehouse.  The 
state  of  the  system  includes: 


♦  II  * 


Figure  2:  Schematic  side-view  of  an  isle  in  the  warehouse  and  an  illustration  of  Task  2. 


•  pp  the  level  of  item  i  in  its  pick-up  location  (i  =  1, . . . ,  N), 

•  dj\  the  level  of  items  associated  with  isle  j  waiting  in  the  depot  (j  =  1 ,...,/), 

•  1 1-:  the  location  of  forklift  k,  (k  =  1, . . . ,  F), 

•  sj.:  the  status  of  forklift  k,  (k  =  1, . . . ,  F), 

•  hk-  health  of  forklift  k,  (k  =  1, . . . ,  F). 

The  capacities  of  pick  up  locations  and  of  the  depot  are  assumed  finite,  each  pi  takes  values 
from  a  discrete  set  {— P,  —P  +  1, . . . ,  —1, 0, 1, . . . ,  P  —  1,  P},  and  each  dj  takes  values  from  a  discrete 
set  {0,1  —  1,Q}.  Specifically,  pi  <  0  indicates  that  \pi\  items  are  “backordered”  at  pick-up 

location  i.  The  warehouse  area  is  partitioned  such  that  at  each  time,  each  forklift  is  assumed  to 
be  located  either  next  to  one  of  the  pick-up  locations  (identified  by  1  or  at  the  depot 

(identified  by  N  +  1),  or  at  the  battery/maintenance  shop  (identified  by  N  +  2),  hence  4  takes 
values  from  a  discrete  set  {1,...,JV,  N  +  1,1V  +  2}.  Each  Sk  takes  values  from  a  discrete  set 
{“Idle”,  “Battery/Maintenance”,  “ Task  1”,  “Task  2”},  where  “Idle”  denotes  that  the  forklift  is 
idle,  “Battery/Maintenance”  denotes  that  the  forklift  is  in  maintenance  or  its  battery  is  being 
replaced,  and  “Task  1”  or  “Task  2”  denote  that  the  forklift  is  performing  either  Task  1  or  Task 
2,  respectively.  Each  hk  also  takes  values  from  a  discrete  set  {0, 1, ... ,  B},  where  0  indicates  that 
the  forklift’s  battery /health  is  in  “bad”  condition,  and  the  higher  the  value  of  %  is,  the  better  the 
health  of  the  forklift  k  is.  The  state  of  the  system,  therefore,  can  be  represented  by  the  following 
vector  of  size  N  +  I  +  3 F : 

{pi,  ■  ■  ■  ,Pn,  d\, . . .  ,di,h, . . .  . . . ,  sf ,  /if}, 

Considering  that  a  good  number  of  variables  are  involved  in  describing  the  problem,  let’s  summarize: 
F  is  the  number  of  forklifts;  the  status  of  each  forklift  takes  4  possible  values;  I  is  the  number  of 


Reserve 

locations 


Figure  3:  Overview  of  the  activities  in  the  warehouse. 


isles;  N  is  the  number  of  pickup  locations;  B  +  1  is  the  number  of  forklift  health  levels;  2 P  +  1  is 
the  number  of  stock  levels  in  each  pickup  location;  and  Q  +  1  is  the  number  of  stock  levels  of  the 
depot.  Then,  the  state  space  of  the  system  has 

(2  P  +  1  )n(Q  +  1  Y{N  +  2)f4  f{B  +  1)F 


discrete  elements. 

The  arrival  processes  of  deliveries  and  demands  are  assumed  to  follow  the  Poisson  distribution, 
whereas  the  quantities  of  the  deliveries  and  demands  are  assumed  to  follow  a  discrete  uniform 
distribution.  No  deliveries  are  allowed  when  the  depot  is  full,  and  no  further  demand  of  an  item  is 
accepted  when  the  current  level  of  the  item  in  its  pick-up  location  is  at  the  lowest  possible  value 
(i.e. ,  the  maximum  backordered  level).  Traffic  congestion  in  the  ith  isle,  indicated  by  K{  is  defined 
as 

Ki  =  IffiX,  +  W2Fil 

where  Ft  is  the  number  of  forklifts  operating  in  that  isle  (known  since  the  position  of  the  forklifts 
is  part  of  the  state),  Xj  is  an  estimate  of  the  pallet  truck  traffic,  defined  as 

Xi  =  log  (~Pj), 

O'lIND (j)=i,  Pj< 0} 

and  W i  and  W2  are  constant  weights  capturing  the  relative  degree  according  to  which  pallet  trucks 
and  forklifts  contribute  to  the  traffic,  respectively.  In  the  above,  IND(j)  is  a  function  that  returns 
the  index  of  the  isle  in  which  the  pick-up  location  of  the  jth  item  is  located. 

Tasks  1  and  2  are  assumed  to  have  exponentially  distributed  durations  with  rates  that  can 
depend  on  the  type  of  the  task.  Given  the  current  state  and  the  particular  forklift  (with  a  known 
location)  to  which  either  Task  1  or  2  gets  assigned,  the  expected  task  duration  is  a  certain  (increas¬ 
ing)  function  of  the  traffic  congestion  present  at  the  destination  isle.  We  assume  that  this  function 
takes  values  in  a  finite  interval,  hence,  the  rate  at  which  a  task  gets  completed  is  finite  for  all  states. 


Assume  also  that  when  a  forklift  is  ordered  to  remain  “Idle”  or  go  to  the  “Battery/Maintenance” 
it  remains  in  the  corresponding  status  for  an  exponentially  distributed  amount  of  time  with  a  rate 
that  may  depend  on  the  status.  It  follows,  that  a  forklift  k  maintains  each  status  Sk  for  an  expo¬ 
nentially  distributed  amount  of  time.  Let  pm ax  the  maximum  rate  (over  all  states)  corresponding 
to  this  exponential  distribution. 

With  the  assumptions  put  in  place  the  state  of  the  system  evolves  as  a  continuous  time  Markov 
chain.  We  uniformize  this  Markov  chain  by  creating  a  Poisson  clock  with  rate  F  pmax  at  which 
transitions  occur.  Given  the  rates  of  various  status  durations,  it  is  straightforward  to  compute 
transition  probabilities  to  the  next  state  each  time  our  uniform  Poisson  clock  ticks;  clearly,  some  of 
these  transitions  are  self-transitions.  We  will  use  our  MDP  machinery  to  make  (forklift  assignment) 
decisions  only  at  the  transition  epochs  of  this  uniformized  Markov  chain. 

The  one-step  cost  g(-)  includes  the  opportunity  cost  of  having  shortages  in  the  pickup  locations 
(<7i(-)),  the  cost  associated  with  the  delay  of  clearing  the  depot  ( <72 ( •)),  and  the  cost  of  operating 
the  forklifts  ( ^3  ( • ) ) ,  where 

N  I 

gi  (x,  u)  =  y^Qmax(^,0),  y2(x,u)  =  Cs^di,  g3(x,it)  =  cfnf. 

i= 1  i=l 

In  the  above,  Ci  is  the  price/value  of  item  i  times  the  chance  of  having  a  sale  per  unit  of  time  if  the 
pickup  location  is  stocked  at  or  above  demand;  cs  is  the  average  cost  of  delay  in  clearing  the  depot 
per  pallet  per  unit  of  time;  Cf  denotes  the  average  cost  of  operating  a  forklift  per  unit  of  time;  and 
rif  denotes  the  number  of  forklifts  that  are  being  operated  (i.e. ,  with  a  status  equal  to  “Task  1”  or 
“Task  2”).  The  one-step  cost  rate  g(-)  is  then  defined  as 

S(x,«)  =  Si(x,«)  +£2(x,u)  +  g3(x.,u). 


As  we  mentioned  in  the  Introduction,  the  forklifts  are  fitted  with  wireless  sensors  and  communi¬ 
cate  with  the  dispatching  system  via  a  warehouse  wireless  sensor  network  (see  [30]  for  an  overview 
of  the  sensor  network  setup).  The  sensors  on  each  forklift  provide  the  forklift’s  location  (needed  in 
order  to  determine  congestion  at  the  isles)  as  well  as  various  forklift  health  indicators  (e.g.,  battery 
status).  These  observations  are  “noisy”  due  to  estimation  errors  but  they  fit  our  POMDP  frame¬ 
work.  To  see  this,  we  show  that  the  separability  assumption  of  the  cost  function  (cf.  Assumption 
A)  holds.  We  can  represent  the  bijective  representation  of  the  state  as 


jW,...,/) 

X(N+ 1)  _  _  _  X{N+I) 

X{N+I+1)  _  _  _  X(N+I+F) 
X(N+I+F+1)  _  _  _  X{N+I+2F) 
X(N+I+2F+1)  _  _  ^  X(N+I+3F) 


=  P 1,  •  •  •  ,PN, 
=  di, . . . ,  di, 

1 1  ,  .  .  .  ,  l]7  , 
<Sl  ,  ?  S] 7  , 

=  hi, . . . ,  Iif 


and  we  can  write 


5(x,u)  =  gi(xm)  +  g2(x(-2))  H - \-  gN+I+3F(x{N+I+3F))  +  gu(u), 


where 


9i(x W) 


Ci  max (/^,  0),  if  1  <  i  <  N, 

<  csdi- isr,  iiN-\-l<i<N  +  I, 

0,  otherwise, 


and 


gu(u)  =  Cfh(u). 

It  follows  that  Assumption  A(i)  holds.  Furthermore,  because  the  processes  of  location  detection 
and  of  the  battery /health  readings  from  each  forklift  are  all  independent  of  each  other,  Assumption 
A(ii)  also  holds  and  the  noise  models  can  be  constructed  (e.g.,  using  sampling  methods).  Therefore, 
the  result  of  Theorem  4.1  is  applicable. 


5.2  The  RSP 


The  action  space  U  contains  N  +  I  +  2  elements: 

1.  the  first  N  elements  correspond  to  Task  2  assignments,  where  each  element  maps  to  one  of 
the  N  pick-up  locations; 

2.  the  next  I  elements  concern  Task  1  assignments,  where  element  N+i,  i  =  1, . . . ,  I,  corresponds 
to  moving  items  from  the  depot  to  a  reserve  location  in  isle  i; 

3.  the  ( N  +  I  +  l)st  element  corresponds  to  “going  to  the  battery/maintenance  shop”;  and, 
finally, 

4.  the  (N  +  I  +  2)nd  element  corresponds  to  “idling”. 

Our  RSP  is  constructed  by  exponential  functions  with  a  4-dimensional  parameter  vector 

0  =  (^1, 02,  03,  &0- 

The  scalar  elements  of  6  are  used  to  weigh,  respectively,  the  urgency  of  the  demand  (0i),  the  delay 
at  the  depot  (02),  traffic  congestion  ($3),  and  forklift  health  ($4).  The  RSP  is  given  by 


where 


Me(«|x) 


1 


(ES/+2 


ai) 


(  ai  N 

V  aN+I+2  / 


P  e01Ci(f/-Pi)+03 Kj^) 

Di_ N  e^di-N+03Kr-N  y 

Be04, 


1  <i<N, 

N  +1  <i  <  N  +  I, 
i  =  N +  1  +  1, 
i  =  N  +  I  +  2, 


in  which  U  is  the  maximum  capacity  of  pick-up  locations,  and 


Pi  = 

Di  = 
B  = 


if  the  pick-up  location  of  the  zth  item  is  full, 
otherwise, 

if  there  is  no  item  associated  with  the  ith  isle  waiting  in  the  depot, 
otherwise, 

if  the  health  condition  of  the  forklift  is  “bad” , 
otherwise. 


(14) 


Note  that,  we  are  optimizing  each  forklift  as  if  it  were  a  single,  independent  agent.  In  our 
uniformized  Markov  chain  at  most  one  forklift  completes  its  current  status  at  any  given  epoch; 
when  this  happens  we  use  the  RSP  described  above  to  assign  a  new  status  to  it. 

Some  observations  on  the  structure  of  the  RSP  are  in  order.  The  probability  of  filling  pick¬ 
up  location  i  increases  with  the  value  c*  of  item  i  and  the  backlog  U  —  pi  but  it  decreases  with 
the  congestion  level  at  the  corresponding  isle  I (j ) .  Note  that  this  probability  is  controlled  by 
parameters  9\  and  63.  Similarly,  the  probability  of  assigning  a  forklift  to  move  material  from  the 
depot  to  a  reserve  location  in  isle  j  is  increasing  with  the  amount  of  such  material  accumulated  at 
the  depot  (dj)  and  decreasing  with  the  congestion  in  isle  j.  The  latter  probability  is  controlled  by 
parameters  62  and  63.  Finally,  the  probability  of  sending  the  forklift  to  the  maintenance/battery 
shop  is  controlled  by  the  parameter  64.  The  task  of  the  actor-critic  algorithm  is  to  optimize  the 
policy  performance  over  6  =  (Qi, ...  ,64). 

5.3  Numerical  results 

We  first  obtained  the  exact  optimal  policy  for  instances  of  a  very  small-scale  version  of  this  problem 
which  involves  2  forklifts  (F  =  2),  2  isles  (7  =  2),  and  4  items  (N  =  4)  using  standard  Dynamic 
Programming  (DP)  -  value  iteration  in  particular.  We  set  P  =  2,  Q  =  lj  and  B  =  1.  We 
generated  12  instances  of  this  small-scale  version  (termed  S-l  up  to  S-12)  by  varying  the  values 
of  the  parameters  in  the  cost  function  as  well  as  the  rates  of  arrivals  of  deliveries  and  demands. 
We  then  used  our  LSTD  actor-critic  algorithm  to  optimize  the  RSP  outlined  earlier.  The  results 
are  shown  in  Table  1  (AC  is  used  as  abbreviation  for  actor-critic  algorithm).  In  addition,  Fig.  5 
illustrates  the  progress  of  the  algorithm  for  the  instance  S-l.  For  the  small-scale  problem,  our  actor- 
critic  algorithm  took  a  much  shorter  time  than  DP  and  consistently  found  near-optimal  solutions 
along  different  sample  paths  (within  10%  of  the  optimal  DP  cost).  We  also  tracked  the  changes 
in  the  average  congestion  (Fig.  6),  average  availability  of  items  to  fulfill  demand  (Fig.  7),  average 
number  of  items  waiting  at  the  depot  (Fig.  8),  and  average  delay  at  the  depot  (Fig.  9).  The  purpose 
is  to  qualitatively  understand  how  the  policy  we  obtained  trades-off  these  factors,  which  is  useful 
in  designing  a  good  heuristic  policy.  For  instance,  we  found  from  the  DP  solution  that  Task  2  has 
higher  priority  than  Task  1. 

We  then  used  this  insight  to  design  a  heuristic  policy  that  we  used  as  a  benchmark  for  larger-scale 
problems  where  DP  becomes  computational  intractable.  The  description  of  the  heuristic  is  given 
in  Fig.  4.  It  is  interesting  that  it  is  consistent  with  state-of-the-art  policies  used  by  practitioners  in 
actual  warehouses.  When  applied  to  the  small-scale  problem  above  it  performs  within  20%  of  the 
optimal  cost. 

In  order  to  evaluate  the  efficiency  of  our  approach,  we  used  it  to  learn  a  locally  optimal  RSP  for 
a  larger-scale  problem  involving  10  forklifts,  4  isles  and  32  items.  Note  that  with  only  2  forklifts, 
2  isles  and  4  items,  the  state  space  contains  5,760,000  discrete  elements.  The  size  of  the  state 
space  for  the  larger-scale  problem  is  much  greater  (8.26e  +  47  discrete  elements),  hence,  it  cannot 
be  solved  with  exact  DP  in  a  reasonable  amount  of  time. 

We  generated  10  instances  of  the  larger-scale  version  (termed  L-l  up  to  L-10)  by  varying  the 
values  of  the  parameters  in  the  cost  function  as  well  as  the  parameters  of  the  distribution  functions 
characterizing  arrivals  of  deliveries  and  demands.  We  then  tested  our  LSTD  actor-critic  algorithm 
to  see  if  it  is  able  to  learn  a  better  policy  than  the  heuristic  policy  of  Fig.  4.  The  results  are 


As  soon  as  a  forklift  finishes  its  task,  assign  a  new  task  to  it  according  to  the  following  order  of 
priorities: 

1.  If  the  forklift’s  health  is  “bad”,  then  the  forklift  is  commanded  to  go  to  the  bat¬ 
tery/maintenance  shop. 

2.  If  there  are  shortages  in  the  pick-up  locations,  then  Task  2  is  assigned.  When  more  than  one 
items  are  in  shortage,  the  item  with  the  highest  unit  price  is  served  first. 

3.  If  there  are  items  waiting  at  the  depot,  then  Task  1  is  assigned. 

4.  Otherwise,  the  forklift  is  instructed  to  idle. 


Figure  4:  The  description  of  the  heuristic  policy  for  the  forklift  dispatching  problem. 


Table  1:  Results  obtained  by  the  algorithms  for  small-scale  problems. 


Instance 

S-l 

S-2 

S-3 

S-4 

S-5 

S-6 

S-7 

S-8 

S-9 

S-10 

S-ll 

S-12 

Optimal  cost  (DP) 

5.44 

12.97 

7.43 

6.32 

14.65 

9.26 

5.82 

14.02 

9.07 

8.18 

15.14 

9.91 

AC  cost 

5.95 

13.23 

7.95 

6.51 

15.97 

10.02 

5.98 

14.86 

9.43 

8.75 

16.53 

10.48 

AC  cost 

Optimal  cost 

1.09 

1.02 

1.07 

1.03 

1.09 

1.08 

1.03 

1.06 

1.04 

1.07 

1.09 

1.06 

Heuristic  cost 

6.52 

15.05 

9.36 

7.27 

17.29 

10.93 

6.93 

16.82 

11.16 

9.73 

17.71 

11.69 

Heuristic  cost 
Optimal  cost 

1.20 

1.16 

1.26 

1.15 

1.18 

1.18 

1.19 

1.20 

1.23 

1.19 

1.17 

1.18 

encouraging  (see  Table  2).  In  addition,  Fig.  10  illustrates  the  progress  of  the  algorithm  for  the 
instance  L-l.  The  LSTD  actor-critic  converged  to  policies  that  represent  more  than  a  15%  cost 
reduction  compared  to  the  heuristic  policy.  In  some  instances,  this  improvement  exceeds  25%.  In 
particular,  we  noticed  that  the  actor-critic  algorithm  did  a  good  job  avoiding  traffic  congestion. 

In  addition,  Fig.  5  and  Fig.  10  also  compare  the  performance  of  our  proposed  LSTD  actor-critic 
with  the  traditional  TD-based  actor-critic  developed  in  [25]  (with  A  in  the  TD(A)  critic  set  to  0.75) 
and  the  algorithm  in  [32]  (natural  actor-critic).  Note  that  our  LSTD  actor-critic  shows  smoother 
and  faster  convergence  behavior  than  both  alternatives. 

We  also  used  the  LSTD-PO  actor-critic  to  learn  the  optimal  policy  under  partial  observability 
of  states  and  cost.  As  shown  in  Fig.  5  and  Fig.  10,  the  simulation  results  confirm  Theorem  4.1. 
We  mention  that  the  LSTD-PO  algorithm  converges  to  10.63  for  the  S-l  instance  which,  as  shown 
in  Fig.  5,  is  much  higher  than  the  cost  achieved  under  perfect  state  observations.  To  compare  the 
former  with  a  more  precise  baseline,  a  lower  bound  of  the  optimal  average  cost  for  the  imperfect 
observation  case  was  computed  using  the  methods  given  in  [41].  To  obtain  this  lower-bound,  we 
discretized  the  belief  space  of  the  POMDP  into  a  finite  number  of  belief  points  and  computed  the 


Figure  5:  Results  obtained  by  the  algorithms  for  the  S-l  instance. 


Table  2:  Results  obtained  by  the  algorithms  for  larger-scale  problems. 


Instance 

L-l 

L-2 

L-3 

L-4 

L-5 

L-6 

L-7 

L-8 

L-9 

L-10 

Heuristic  cost 

18.07 

26.11 

22.44 

31.72 

27.32 

38.87 

31.06 

40.43 

33.38 

41.82 

AC  cost 

14.11 

21.41 

16.61 

26.64 

21.58 

31.48 

24.85 

33.15 

25.37 

34.29 

Heuristic  cost  -  ACJ  cost 

Heuristic  cost 

0.22 

0.18 

0.26 

0.16 

0.21 

0.19 

0.20 

0.18 

0.24 

0.18 

approximating  functions  at  these  points  using  multi-chain  algorithms.  We  discretized  the  belief 
space  (which  is  a  simplex  in  an  |X|-dimensional  space)  into  a  regular  grid  according  to  pattern  k-E 
introduced  in  Section  4  of  [41] .  Briefly,  the  pattern  k-E  consists  of  k  grid  points  on  each  edge,  in 
addition  to  the  vertices  of  the  belief  simplex.  We  then  used  the  pattern  10  —  E  and  obtained  the 
value  9.7  for  the  lower-bound  on  the  optimal  average  cost  which  implies  that  LSTD-PO  finds  near- 
optimal  solutions.  Finally,  we  mention  that  the  dimensionality  of  the  belief  space  for  the  larger-scale 
problem  is  too  large,  hence,  it  is  impractical  to  obtain  such  a  lower-bound  in  a  reasonable  amount 
of  time. 

The  numerical  experiments  above  were  implemented  in  C  on  a  PC  with  1.4  GHz  CPU  speed. 
On  average,  the  exact  DP  for  the  small-scale  instances  took  about  10  hours  to  converge  whereas  the 
actor-critic  converged  within  20  minutes.  For  the  larger-scale  instances,  the  actor-critic  algorithm 
took  about  2  hours  to  converge. 
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Figure  6:  Congestion  data  obtained  by  the  proposed  actor-critic  algorithm  for  the  S-l  instance. 


6  Conclusions 

This  paper  integrated  a  least-squares  temporal  difference  learning  method  into  an  efficient  actor- 
critic  architecture  where  the  actor  and  the  critic  updates  are  carried  out  concurrently.  The  conver¬ 
gence  of  the  algorithm  was  established,  and  the  performance  of  the  algorithm  was  demonstrated  by 
solving  a  forklift  dispatching  problem  arising  in  warehouse  management.  For  that  application,  and 
for  instances  of  realistic  size,  we  demonstrated  a  20%  performance  improvement  over  a  heuristic 
similar  to  the  ones  used  in  practice.  We  also  extended  the  applicability  of  actor-critic  algorithms 
to  general  cases  of  POMDPs  where  even  the  cost  is  not  perfectly  observable.  The  algorithms 
presented  in  this  paper  use  policies  that  are  functions  of  immediate  observations.  Although  in  a 
POMDP  setting,  more  complex  policies  such  as  controllers  with  internal  states  can  be  used  ([6,  2]), 
the  actor-critic  methods  can  also  be  used  as  alternatives  to  these  methods  in  learning  finite-state 
controllers  (see  [40]).  In  the  POMDPs  with  indirectly  observed  cost,  the  use  of  such  alternatives 
could  be  an  interesting  direction  for  further  research. 
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Appendix 

A  Proof  of  Theorem  3.1 

We  first  cite  the  theory  of  linear  stochastic  approximation  driven  by  a  slowly  varying  Markov  chain 
[24]  (with  simplifications). 
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Figure  7:  Pick-up  location  data  obtained  by  the  proposed  actor-critic  algorithm  for  the  S-l  instance. 


Let  {yk}  be  a  finite  Markov  chain  whose  transition  probabilities  depend  on  a  parameter  9  G  Mn. 
Consider  a  generic  iteration  of  the  form 

sfc+i  =  sfc  +  7fc(h6)fc(yi;+i)  -  G0fc(yfc+i)sfc)  +  7fcHfcsfe,  (15) 

where  G  Mm,  E&  is  some  rn  x  m- matrix,  and  he(-)  G  G  Mmxm  are  0-parameterized 

vector  and  matrix  functions,  respectively.  This  equation  sets  up  a  structure  of  a  regression  process 
whose  weight  is  also  determined  by  a  stochastic  process.  The  simultaneous-update  actor-critic  we 
are  considering  is  a  special  case.  It  has  been  shown  in  [24]  that  the  critic  in  (15)  converges  if  the 
following  set  of  conditions  are  met. 

Condition  1 

1.  The  sequence  {^k}  is  deterministic,  non-increasing,  and 

J2lk  =  oo,  ^2^1  <oo. 
k  k 

2.  The  random  sequence  { 6 *,}  satisfies  ||0fe+i  — 0fc||  <  fikHk  for  some  process  { H f}  with  bounded 
moments,  where  {fik}  is  a  deterministic  sequence  such  that 

d 

<  oo  for  some  d  >  0. 

3.  3*.  is  an  m  x  m-matrix  valued  martingale  difference  with  bounded  moments. 

4 ■  For  each  0,  there  exist  h(0)  G  Mm;  G(0)  G  Rmxm,  and  corresponding  m-vector  and  m  x  m- 
matrix  functions  M-),  Gfl(0  that  satisfy  the  Poisson  equation.  That  is,  for  each  y, 

My)  =  h0(y)  -  h(6>)  +  (Pehfl)(y), 

Ge(y)  =  Ge(y)  -  G (9)  +  (Pe Ge)(y). 
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Figure  8:  Depot  data  obtained  by  the  proposed  actor-critic  algorithm  for  the  S-l  instance. 


5.  For  some  constant  C  and  for  all  6,  we  have  max(||h(0)||,  ||G(0)||)  <  C. 

6.  For  any  d>  0,  there  exists  Cd>  0  such  that  supfc  E[| |f0fc  (yfc) | |rf]  <  Cd,  where  fg(-)  represents 
any  of  the  functions  he(-),  hg(-),  Gq(-)  and  Gg(-). 

7.  For  some  constant  C  >  0  and  for  all  0,6  G  Mn,  max(||h(0)  —  h(0)||,  ||G(0)  —  G(0)||)  < 
C\\G-G\\. 

8.  There  exists  a  positive  measurable  function  C(-)  such  that  for  every  d  >  0.  supfc  E[C(yfc)d]  < 
00,  and  ||fe(y)  -  fe(y)||  <  G(y)||0  -  6 1|. 

9.  There  exists  a  >  0  such  that  for  all  s  G  Mm  and  6  G  Mn 

s/G(0)s  >  a||s||2. 


If  these  conditions  are  all  met,  then 

lim  sk  =  G(0fc)_1h(0fc). 

k — >00 


For  now,  let’s  focus  on  the  first  two  items  of  Condition  1.  For  any  matrix  A  let  v(A)  be  a  column 
vector  that  stacks  all  row  vectors  of  A  (also  written  as  column  vectors).  Simple  algebra  suggests 
that  the  core  iteration  of  the  LSTD  critic  can  be  written  as  (15)  with 
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Figure  9:  Delay  data  obtained  by  the  proposed  actor-critic  algorithm  for  the  S-l  instance. 
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where  M  is  an  arbitrary  (large)  positive  constant  whose  role  is  to  facilitate  the  convergence  proof. 

The  step-sizes  7 k  and  (3k  hi  (7)  and  (9)  correspond  exactly  to  the  7*,  and  (3k  in  Condition  1.(1) 
and  1.(2),  respectively.  If  the  MDP  has  finite  state  and  action  space,  then  the  conditions  on  {(3k} 
reduce  to  ([24]) 


XM  =  00, 
k 


X^  <  °°> 


lim  —  =  0, 

k^oo  7fc 


(17) 


where  {(3k\  is  a  deterministic  and  non- increasing  sequence.  Note  that  we  can  use  7*,  =  l/k  (cf. 
Condition  1).  The  following  theorem  establishes  the  convergence  of  the  critic. 


Lemma  A.l  [Critic  Convergence]  For  the  LSTD  actor-critic  (7)  and  (8)  with  some  step-size  se¬ 
quence  {(3k}  satisfying  (17),  the  sequence  s*.  is  bounded,  and 


lim  \G(dk)sk  -  h(0fc)|  =  0.  (18) 

fc— »o O 


Proof  :  To  show  that  (15)  converges  with  s,  y,  he(-),  Gg(-)  and  E  substituted  by  (16),  the  condi¬ 
tions  l.(l)-(9)  should  be  checked.  However,  a  comparison  with  the  convergence  proof  for  the  TD(A) 
critic  in  [25]  gives  a  simpler  proof.  Let 

Fe(y)  =  z(V>e(x,u)  -  (Petf0)'(x,  u)). 


While  proving  the  convergence  of  the  TD(A)  critic  operating  concurrently  with  the 
showed  that 
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Figure  10:  Results  obtained  by  algorithms  for  the  larger-scale  problems. 


“  "  L°  zfc(V>efc(xfc+l,Wfc+l)  -  (M/>e)'(xfc>M) 
satisfy  Condition  l.(3)-l(8).  In  our  case,  (16)  can  be  rewritten  as 
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Note  that  h#,  G g  and  S  are  linear  functions  of  hg(y),  Fe(y),  z,  and  E,  thus  share  their  convergence 
properties.  It  is  clear  then  that  they  also  satisfy  satisfy  Conditions  l.(3)-l(8).  So,  h0(-),Ge(-)  and 
Efc  also  satisfy  Condition  l.(3)-l(8).  Meanwhile,  the  step-size  {7*,}  satisfies  Condition  1.(1),  and 
the  step-size  {flk}  satisfies  Eq.  (17)  (which  is  as  explained  above  implies  Condition  1.(2)).  Now, 
only  Condition  1(9)  remains  to  be  checked.  To  that  end,  note  that  all  diagonal  elements  of  Gg(y) 
equal  to  one,  and  the  only  off-diagonal  block  is  a  bounded  matrix  divided  by  M.  So,  Gg(y)  is 
positive  definite  if  M,  which  can  be  set  arbitrarily,  is  large  enough.  Thus,  Condition  1(9)  also 
holds.  This  proves  the  convergence. 

Last,  to  show  that  Equation  (18)  is  satisfied,  one  only  needs  to  invoke  the  same  correspondence 
and  the  result  in  [25]. 


Theorem  3.1  follows  by  setting  <f>g  =  i\}g  and  following  the  proof  in  Section  6  of  [25]. 


B  Proof  of  Theorem  4.1 


Under  the  ergodicity  conditions  mentioned  earlier,  {x^,}  and  are  Markov  chains  with  sta¬ 

tionary  distributions  under  any  policy  6.  Let  7rg(x)  and  r/e(x,  u)  denote  the  stationary  distributions 
of  {xfc}  and  {xfc,Ufc},  respectively.  Since  the  observations  are  generated  directly  from  {x^.},  {y^} 
and  {yfc,«fc}  also  have  stationary  distributions  (denoted  by  7Tg(y)  and  rj0(y,u ),  respectively).  The 
average  cost  proxy  can  be  rewritten  as 
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Let  Po(u |x)  =  Po(u\y)p(y\x) ■  Then  the  second  term  of  Eq.  (20)  reduces  to 


which  is  a  term  readily  found  in  the  average  true  cost.  Since  Yhu  Po(u |y)  =  L  the  first  term  of  Eq. 
(20)  equals 
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Note  that  for  the  marginal  distribution  of  y®  given  x  it  holds 
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The  right  hand  side  of  Eq.  (21)  becomes 
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(22) 


From  the  definition  of  v^(l)  (i) , 


EMy<'»|x‘'V®,yW 


thus,  the  right  hand  side  of  (22)  equals 


1,  if^)=x(0, 

0,  otherwise, 
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Now  both  terms  in  Eq.  (20)  have  been  calculated  and  we  conclude 
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